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APPROXIMATE NULL DISTRIBUTION OF THE LARGEST ROOT 
IN MULTIVARIATE ANALYSIS 1 

By Iain M. Johnstone 

Stanford University 

The greatest root distribution occurs everywhere in classical mul- 
tivariate analysis, but even under the null hypothesis the exact distri- 
bution has required extensive tables or special purpose software. We 
describe a simple approximation, based on the Tracy-Widom distri- 
bution, that in many cases can be used instead of tables or software, 
at least for initial screening. The quality of approximation is studied, 
and its use illustrated in a variety of setttings. 

1. Introduction. The greatest root distribution is found everywhere in 
classical multivariate analysis. It describes the null hypothesis distribution 
for the union intersection test for any number of classical problems, includ- 
ing multiple response linear regression, MANOVA, canonical correlations, 
equality of covariance matrices and so on. However, the exact null distribu- 
tion is difficult to calculate and work with, and so the use of extensive tables 
or special purpose software has always been necessary. 

This paper describes a simple asymptotic approximation, based on the 
Tracy Widom distribution. The approximation is not solely asymptotic; we 
argue that it is reasonably accurate over the entire range of the parameters. 
"Reasonably accurate" means, for example, less than ten percent relative 
error in the 95th percentile, even when working with two variables and any 
combination of error and hypothesis degrees of freedom. 

This paper focuses on the approximation, its accuracy and its applica- 
bility to a range of problems in multivariate analysis. A companion paper 
[Johnstone (2008)] contains all proofs and additional discussion. 

Our main claim is that for many applied purposes, the Tracy-Widom 
approximation can often, if not quite always, substitute for the elaborate 
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tables and computational procedures that have until now been needed. Our 
hope is that this paper might facilitate the use of the approximation in 
applications in conjunction with appropriate software. 

1.1. A textbook example. To briefly illustrate the Tracy- Widom approx- 
imation in action, we revisit the rootstock data, as discussed in Rencher 
(2002), pages 170-173. In a classical experiment carried out from 1918-1934, 
apple trees of different rootstocks were compared (Andrews and Herzberg 
[(1985), pages 357-360] has more detail). Rencher (2002) gives data for eight 
trees from each of six rootstocks. Four variables are measured for each tree: 
Girth4 = trunk girth at 4 years in mm, Growth4 = extension growth at 4 
years in m, Girthl5 = trunk girth at 15 years in mm, and Wtl5 = weight of 
tree above ground at 15 years in lb. 
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A one-way multivariate analysis of variance can be used to examine the 
hypothesis of equality of the four-dimensional vectors of mean values corre- 
sponding to each of the six groups (rootstocks) . The standard tests are based 
on the eigenvalues of (W + B) _1 B, where W and B are the sums of squares 
and products matrices within and between groups respectively. We focus 
here on the largest eigenvalue, with observed value 9 ohs = 0.652. Critical val- 
ues of the null distribution depend on parameters, here s = 4, m = 0, n = 18.5 
[using (8) below, along with the conventions of Section 5.1 and Definition 6]. 
Traditionally these are found by reference to tables or charts. Here, the 0.05 
critical value is found — after manual interpolation in those tables — to be 
#0.05 = 0.377. The approximation (6) of this paper yields the approximate 
0.05 critical value OqxH = 0.384, which clearly serves just as well for rejection 
of the null hypothesis. 

It is more difficult in standard packages to obtain p- values corresponding 
to 6 ohs . The default is to use a lower bound based on the F distribution 
[see (12)], here pp{9 ohs ) = 1.7 x 10 -8 , which is anti-conservative and several 
orders of magnitude below the Tracy-Widom approximation given in this 
paper at (11), PTw(# obs ) = 5.6 x 10" 5 . The latter is much closer to the 
formally correct value, 2 p(9 ohs ) = 3.7 x 10~ 6 . When p-values are very small, 



2 This (actually approximate) value is obtained by interpolation from Koev's function 
pmaxeigjacobi which only handles integer values of n. 
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typically only the order of magnitude is of interest. We suggest in Section 2.2 
that the Tracy- Widom approximation generally comes close to the correct 
order of magnitude, whereas the default F bound is often off by several 
orders. 

1.2. Organization of paper. The rest of this introduction provides enough 
background to state the main Tracy- Widom approximation result. Section 2 
focuses on the quality of the approximation by looking both at conventional 
percentiles and at very small p-values. The remaining Sections 3-6 describe 
some of the classical uses of the largest root test in multivariate analysis, 
in each case in enough detail to identify the parameters used. Some extra 
attention is paid in Section 6 to the multivariate linear model, in view of 
the wide variety of null hypotheses that can be considered. 

1.3. Background. Our setting is the distribution theory associated with 
sample draws from the multivariate normal distribution. For definiteness, 
we use the notation of Mardia, Kent and Bibby (1979), to which we also 
refer for much standard background material. Thus, if xi, . . . ,x n denotes a 
random sample from N p (fx, S), a p-variate Gaussian distribution with mean 
H and covariance matrix E, then we call the nxp matrix X = (xi, . . . , x n )', 
whose ith row contains the ith sample p-vector, a normal data matrix. 

A p x p matrix A that can be written A = X'X in terms of such a normal 
data matrix is said to have a Wishart distribution with scale matrix I] and 
degrees of freedom parameter n, A ~ W P (S, n). When p=l, this reduces to 
a scaled chi-squared law c 2 X^ n ) • 

We consider analogs of the F and Beta distributions of multivariate anal- 
ysis, which are based on two independent chi-squared variates. Thus, let 
A ~ Wp(£,m) be independent of B ~ W p (T,,n). If m >p, then A -1 exists 
and the nonzero eigenvalues of A _1 B are quantities of interest that general- 
ize the univariate F ratio. We remark that the scale matrix S has no effect 
on the distribution of these eigenvalues, and so, without loss of generality, 
we can suppose that S = I. 

The matrix analog of a Beta variate is based on the eigenvalues of (A + 
B) X B, and leads to the following: 

Definition 9 [Mardia, Kent and Bibby (1979), page 84]. Let A~ 
W p (l,m) be independent of B ~ W p (l,n), where m>p. Then the largest 
eigenvalue 9 of (A + B) _1 B is called the greatest root statistic and its dis- 
tribution is denoted 9{p,m,n). 

Since A is positive definite, we have < 6 < 1. Clearly 9(p,m,n) can also 
be defined as the largest root of the determinantal equation 



det[B-0(A + B)] =0. 
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Specific examples will be given below, but in general the parameter p refers 
to dimension, m to the "error" degrees of freedom and n to the "hypothesis" 
degrees of freedom. Thus, m + n represents the "total" degrees of freedom. 

There are min(n,p) nonzero eigenvalues of A _1 B or, equivalently, min(n,p) 
nonzero roots 9 = (9i) of the determinantal equation above. The joint density 
function of these roots is given by 

min(n,p) 

(1) p(9) = C \{ ^ |n - p| - 1)/2 (l-^) (m " p " 1)/2 AW, 

i=i 

where A(0) = Tii^j @j\ ( see ' e -S-> Muirhead [(1982), page 112], or Anderson 
[(2003), pages 536-537]). We shall not need the explicit form of the density 
in this paper; it is, however, useful sometimes in matching up the various 
parameter choices used in different references and packages. 
The greatest root distribution has the property 

9{p, m, n) = 9(n, m + n — p,p), 

useful, in particular, in the case when n < p [e.g. Mardia, Kent and Bibby 
(1979), page 84]. 

1.4. Main result. Empirical and theoretical investigation has shown that 
it is useful to develop the approximation in terms of the logit transform of 
9; thus, we define 

(9\D TTl Til 
i Z7 — 1 \ 
1 — 9{p, m, n) 

Our main result, stated more formally below, is that with appropriate 
centering and scaling, W is approximately Tracy-Widom distributed: 

W(p,m,n) -/j,(p,m,n) -n 

[6) r =^1. 

a{p, m, n) 

The centering and scaling parameters are defined by 

(4) fj,(p, m,n) = 2 log tan' 

(5) a 3 (p,m,n 



(m + n— l) 2 sin 2 ((f) + 7) sin cj) sin 7 ' 
where the angle parameters 7, eft are defined by 

2 (l\ _ min(p,n) - 1/2 



sin 



sin 2 1 — 



2 / m + n — 1 

max(p, n) — 1/2 



m + n — 1 
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Fig. 1. Density of the Tracy-Widom distribution F\. 

1.5. More on the Tracy-Widom law. The F\ distribution, due to 
Tracy and Widom (1996) and plotted in Figure 1, has its origins in mathe- 
matical physics — see Tracy and Widom (1996); Johnstone (2001) for further 
details. The density is asymmetric, with mean = —1.21 and SD = 1.27. Both 
tails have exponential decay, the left tail like e _ ' s Z 24 and the right tail like 

e -(2/3) S y\ 

For the present paper, what is important is that the F\ distribution does 
not depend on any parameters, and the distribution itself, along with its 
inverse and percentiles, can be tabulated as univariate special functions. 
These functions play the same role in this paper as the standard normal dis- 
tribution 3>, its inverse and percentiles z a play in traditional statistical 
application. 

Software. An R package RMTstat is available at CRAN (cran.r-project.org). 
It facilitates computation of the distributional approximations and largest 
root tests described in this paper, and the use of percentiles and random 
draws from the F\ distribution. Its scope and use is described in more de- 
tail in an accompanying report Johnstone et al. (2010). A parallel MATLAB 
package is in development; it will also contain code to reproduce the figures 
and table in this paper. 

Percentiles. Let f a denote the crth percentile of F\ . For example, 
/o. 90 = 0.4501, /0.95 = 0.9793, / . 99 = 2.0234. 
Then the ath percentile of 6(p,m,n) is given approximately by 
(6) 9 a = e^ +faa /(l + e^ +faa ), 

where fj, = fj,(p,m,n),a = a(p,m,n) are given by (4) and (5). 
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The more formal statement of (3) goes as follows. Assume p, m and n — > oo 
together in such a way that 

(7) lim — > 0, lim — > 1. 

m + n p 

For each sq S R, there exist c, C > such that for s > so, 

\P{W(p,m,n) < fi(p,m,n) + a(p,m,n)s} - Fi(s)\ < Cp~ 2 ^ 3 e~ cs . 

For the full proof and much more discussion and detail, see the companion 
paper [Johnstone (2008)]. 

Remarks. Smallest eigenvalue. If A and B are as in the definition of 
9(p, m, n), then let 9(p, m, n) denote the smallest eigenvalue of (A + B) _1 B. 
Its distribution is given by 

9(p, m, n) = 1 — 9(p, n, m), 

(note the reversal of m and n!). In particular, the Tracy-Widom distribution 
will give a generally useful approximation to the lower tail of 9(p,m,n). 

Complex-valued data. There is an entirely analagous result when A and 
B follow complex Wishart distributions, with a modified limit distribution 
F2. Details are given in Johnstone (2008). 

2. Quality of approximation. 

2.1. Comparison with percentiles. There is a substantial literature com- 
puting percentage points of the greatest root distribution for selected param- 
eter values, partially reviewed below. The standard paramaterization used 
in these tables arises from writing the joint density of the roots 9{ as 

s 

P (e) = cY[9r(i-e i ) n A(e). 

1=1 

From this and (1) it is apparent that our "MKB" parameters (p,m,n) are 
related to the "Table" parameters (s, m,n) via 

s = min(n,p), P = s, 

(8) m = (\n-p\ - l)/2, m = s + 2n + l, 

n = (m — p— l)/2, n = 5 + 2m + 1. 

In terms of the table parameters and N = 2(s + m + n)-|-l, the centering 
and scaling constants of the Tracy-Widom approximation are given by 

^(2)-(.-i)/N, 8 in>(!)4 + 2m + i)/N 
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and 



(9) 



f (t> + l\ 
/i = 2 log tan I — ^— 




We turn to the comparison of percentage points from the Tracy 



Widom approximation (6) with the exact values 9 a for small values of the 
table parameters (s, m, n). The most extensive tabulations of 9 a (s, m, n) have 
been made by William Chen; he has graciously provided the author with the 
complete version of the tables excerpted in Chen (2002, 2003, 2004a, 2004b). 

Figures 2 and 3 plot 6^ w against 6 a at 95th and 90th percentiles for s = 2. 
This is the smallest relevant value of s — otherwise we are in the univariate 
case covered by F distributions. The bottom panels, in particular, focus on 
the relative error 



Figure 2 shows that even for s = 2, the 95th percentile of the TW approx- 
imation has a relative error of less than 1 in 20 except in the zone where 
both m < 2 and n > 10, where the relative error is still less than 1 in 10. 
Note that the relative error is always positive in sign, implying that the ap- 
proximate critical points yield a conservative test. More extensive contour 
plots covering s = 2(1)6 and 90th, 95th and 99th percentiles may be found 
in Johnstone and Chen (2007). 

Work on tables. There has been a large amount of work to prepare ta- 
bles or charts for the null distribution of the largest root, much of which 
is reviewed in Chen (2003). We mention contributions by the following: 
Nanda (1948, 1951); Foster and Rees (1957); Foster (1957, 1958); Pillai 
(1955, 1956a, 1956b, 1957, 1965, 1967); Pillai and Bantegui (1959); Heck 
(I960); Krishnaiah (1980); Pillai and Flury (1984); Chen (2002, 2003, 2004a, 
2004b). 

Because of the dependence on the three parameters, these tables can run 
up to 25 pages in typical textbooks, such as those of Johnson and Wichern 
(2002) and Morrison (2005). 

Code. Constantine (1963) expresses the c.d.f. of the largest root distri- 
bution in terms of a matrix hypergeometric function. Koev and Edelman 
(2006) have developed efficient algorithms (and a MATLAB package avail- 
able at http://www-math.mit.edu/~plamen) for the evaluation of such ma- 
trix hypergeometric functions using recursion formulas from group represen- 
tation theory. 

Koev (2010) collects useful formulas and explains how to use them and mhg 
to compute the exact c.d.f. and percentiles for the largest root distribution 
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Fig. 2. Comparison of exact and approximate 95th percentiles for s = 2. Top panel: solid 
line is the Tracy-Widom approximation #^ w (2, m, n) plotted as a function of m for values 
of n shown. Dashed lines are the exact percentiles # a (2,m,n) from Chen's tables. Bottom 
panel: Contour plots of relative error r = (8& w /9 a ) — 1. Horizontal axis is m, vertical axis 
is log 10 n, thus covering the range from n = 1 to 1000. 
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m 

Fig. 3. Comparison of exact and approximate 90th percentiles for s = 4. Top panel: solid 
line is the Tracy-Widom approximation #J W (4, m, n) plotted as a function of m for values 
of n shown. Dashed lines are the exact percentiles 9 a (A, m,n) from Chen's tables. Bottom 
panel: Contour plots of relative error r = (#q W /9 a ) — 1. Horizontal axis is m, vertical axis 
is log 10 n, thus covering the range from n = 1 to 1000. 
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over a range of values of the "MKB" parameters corresponding to m,n,p< 
17, and m, n,p < 40 when n — p is odd. 

SAS/STAT 9.0 made available an option for computing exact p- values 
using Davis (1972); Pillai and Flury (1984). There is also some stand-alone 
software described by Lutz (1992, 2000). 

2.2. Accuracy of p-values. The univariate F bound. We recall the hy- 
pothesis that A ~ W p (I,m) be distributed independently of B ~ W p (l,n), 
and the characterization of the largest eigenvalue given by 

, . i~s u'Bu 
10 A max A^B = max—. 

|u|=i u'Au 

For fixed u of unit length, the numerator and denominator are distributed 
as independent xf n ^ an d X( m ) respectively, and so, again for fixed u, the 
ratio has an F nm distribution. Consequently, we have the simple bound 

— A m ax(A _1 B) > F ~ F n m . 
n 

Using the F n ^ m distribution in place of the actual greatest root law yields 
a lower bound for the significance level, or p-value. We shall see that this 
bound can be anti-conservative by several orders of magnitude, leading to 
overstatements of the empirical evidence against the null hypothesis. And 
furthermore, one can expect that the higher the dimension p of the search 
space in (10), the worse the bound provided by the F distribution. 

The default p- value provided in both SAS and R (through package car) 
uses this unsatisfactory distribution bound. 

Table 1 attempts to capture a variety of scenarios within the computa- 
tional range of Koev's software. 

Column Exact shows a range of significance levels a covering several 
orders of magnitude. Column Largest Root shows the corresponding quan- 
tiles a of the largest root distribution, for the given values of (s, m, n) — these 
are computed using Koev's MATLAB routine qmaxeigjacobi. Thus, an ob- 
served value of 0(s, m, n) — 9 a would correspond to an exact p- value ex. 

The remaining columns compare the Tracy-Widom approximation and 
the F bound. The p-value obtained from the Tracy-Widom approximation 
is given by 

(11) Prw(0a) = 1 - *i((logit(0 a ) - n)/a), 

where \x and a are computed from (9). 
The F bound on the p- value is given by 

(12) P(9(s, m, n) > a ) > P F {6 a ) = 1 - F vl>V2 {u 2 6 a /{v x {l ~ 0a))), 

where v\ = s + 2m + 1 and v 2 = s + 2n + 1 denote the hypothesis and error 
degrees of freedom respectively. 
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The two tables consider s = 2 and 6 variables respectively. The values of 
m = —0.5 and 5 correspond to s and s + 11 hypothesis degrees of freedom, 
while the values of n = 2 and 10 translate to s + 5 and s + 21 error degrees 
of freedom respectively. 

At the 10% and 5% levels, the Tracy- Widom approximation is within 20% 
of the true p- value at s = 6, and within 35% of truth at s = 2. The F- value is 
wrong by a factor of four or more at s = 2, and by three orders of magnitude 
at s = 6. At smaller significance levels, the Tracy- Widom approximation 
generally stays within one order of magnitude of the correct p- value — except 
at (s, m,n) = (2,-0.5,10). The F approximation is off by many orders of 
magnitude when s = 6. 

In addition, we note that the Tracy-Widom approximation is conser- 
vative in nearly all cases, the exception being for 9 > 0.985 in the case 
(s, m,n) = (6,-0.5,2). In contrast, the F approximation is always [cf. (12)] 
anti-conservative, often badly so. 

In applications one is often concerned only with the general order of mag- 
nitude of the p-values associated with tests of the various hypotheses that 
are entertained — not least because the assumptions of the underlying model 
are at best approximately true. For this purpose, then, it may be argued 
that the TW approximate p-value is often quite adequate over the range of 
(s, m,n) values. Of course, if (s, m,n) is not too large and greater precision 
is required, then exact p-values can be sought, using, for example, SAS or 
Koev's software. 

3. Testing for independence of two sets of variables. Let xi, . . . , x n be a 

random sample from N p (fj,, S). Partition the variables into two sets with di- 
mensions p\ and p2 respectively, p\ +P2 = P- Suppose that 5] and the sample 
covariance matrix S are partitioned correspondingly. We consider testing the 
null hypothesis of independence of the two sets of variables: E12 = 0. The 
union-intersection test is based on the largest eigenvalue Ai of S^ 2 1 S2iS| L1 1 Si2 
(Mardia, Kent and Bibby (1979), page 136) and under Hq,X\ has the great- 
est root distribution 6(p2,n — 1 — pi,p±). Mardia, Kent and Bibby (1979) 
consider an example test of independence of n = 25 head length and breadth 
measurements between first sons and second sons, so that p\ = P2 = 2. 
The observed value A° bs = 0.622 exceeds the critical value #0.05 = 0.330 
found by interpolation from the tables. The Tracy-Widom approximation 
^005 = 0-356 is found from (6) and serves equally well for rejection of Hq in 
this case. 

4. Canonical correlation analysis. Again we have two sets of variables, 
an x-set with q variables and a y-set with p variables. The goal is to find 
maximally correlated linear combinations r] = a'x and <j) = b'y. We suppose 
that (X, Y) is a data matrix of n samples (rows) on q+p variables (columns) 
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Table 1 

Comparison of the Tracy- Widom approximation and F bound for cases with s = 2 and 

s = 6 variables 
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6.88e-006 


7 
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13e-011 



such that each row is an independent draw from N p+q (fj,, £). Again let S be 
the sample covariance matrix, assumed partitioned S = ( g 1 1 s ^ ) . The sam- 
ple squared canonical correlations (rf) for i = 1, . . . ,k = min(p, q) are found 
as the eigenvalues of M2 = S^ 1 S2iS i ~ 1 1 Si2 [Mardia, Kent and Bibby (1979), 
Sections 10.2.1 and 10.2.2]. The population squared canonical correlations 
pf are, in turn, the eigenvalues of £32 ^2±£;ii ^12- ^ n both cases, we assume 
that the correlations are arranged in decreasing order. The test of the null 
hypothesis of zero correlation, Hq : p\ = ■ ■ ■ = pk = 0, is based on the largest 
eigenvalue r\ of M2. Under Ho, it is known that r\ has the 6(p,n — q — l,q) 
distribution, so that the Tracy-Widom approximation can be applied. 

Nonnull cases — a conservative test. Often it may be apparent that the 
first k canonical correlations are nonzero and the main interest focuses on 
the significance of r| +1 ,r| +2 , etc. We let H s denote the null hypothesis that 
p s+ i = ■ ■ ■ = p p = 0, and write C(rk\p, q, n; 5]) for the distribution of the rth 
c.c. under population covariance matrix S. When the covariance matrix 
S € H s , the (s + l)st canonical correlation is stochastically smaller than the 
largest canonical correlation in a related null model: 
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Lemma 1. IfSeH s , then 

st 

C(r s+ i \p, q, n; S) < C{r x \p, q - s, n; I) . 

This nonasymptotic result follows from interlacing properties of the sin- 
gular value decomposition (Appendix). Since L{r\ \p, q — s,n;I) is given by 
the null distribution 8(p, n + s — q — 1, q — s) , we may use the latter to provide 
a conservative p- value for testing H s . In turn, the p- value for 9(p,n + s — q — 
l,q — s) can be numerically approximated as in (6) using the Tracy- Widom 
distribution. 

Example. Waugh (1942) gave perhaps the first significant illustration of 
CCA using data on n = 138 samples of Canadian Hard Red Spring wheat 
and the flour made from each of these samples. The aim was to seek highly 
correlated indices a'x of wheat quality and b'y of flour quality, since a 
well correlated grading of raw (wheat) and finished (flour) products was 
believed to promote fair pricing of each. In all, q = 5 wheat characteristics — 
kernel texture, test weight, damaged kernels, foreign material, crude protein 
in wheat — and p = 4 flour characteristics — wheat per bushel of flour, ash 
in flour, crude protein in flour, gluten quality index — were measured. The 
resulting squared canonical correlations were (^i,r|,r|,r|) = (0.923,0.554, 
0.056,0.008). The leading correlation would seem clearly significant and, 
indeed, from our approximate formula (6), #^99 = 0.184. 

To assess the second correlation r2, we appeal to the conservative test 
discussed above based on the null distribution with q — 1 = 4,p = 4 and n = 
138. The Tracy-Widom approximation 0™ ~ ^ + 2.023ct = 0.166 < 0.554, 
which strongly suggests that this second correlation is significant as well. 

Marginal histograms naturally reveal some departures from symmetric 
Gaussian tails, but they do not seem extreme enough to invalidate the con- 
clusions, which are also confirmed by permutation tests. 

5. Tests of common means or variances. 

5.1. Equality of means for common covariance. Suppose that we have 
k populations with independent data matrices Xj consisting of ni observa- 
tions drawn from an N p (fj, i ,'Si) and put n = X] n i- This is the one-way 
multivariate analysis of variance illustrated in Example 1.1. For testing 
the null hypothesis of equality of means Hq:^ = ■■■ = fi k , we form, for 
each population, the sample mean Xj and covariance matrix Sj, normalized 
so that riiSi ~ W p (Ti,ni — 1). The basic quantities are the within groups 
sum of squares W = ^ n^Sj ~ W p (£, n — k) and the between group sum of 
squares B = ^ nj(xj — x)(xj — x)' ~ W p (T,,k — 1) under Hq, independently 
of W. The union-intersection test of Hq uses the largest root of W _1 B 
or, equivalently, that of (W + B) _1 B, and the latter has, under Hq, the 
9(p, n — k, k — 1) distribution. 
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5.2. Equality of covariance matrices. Suppose that independent samples 
from two normal distributions N p (fi 1 , Ei) and N p (fj, 2 , ^2) lead to covariance 
estimates S.; which are independent and Wishart distributed on n« degrees 
of freedom: n^Sj ~ W p (rii, Xlj) for i = 1,2. Then the largest root test of the 
null hypothesis Hq : £1 = E2 is based on the largest eigenvalue 9 of (niSi + 
re 2£>2)~ 1 ra2S2, which under Hq has the 9(p,ni,ri2) distribution Muirhead 
(1982), page 332. 

6. Multivariate linear model. The multivariate linear model blends ideas 
well known from the univariate setting with new elements introduced by 
correlated multiple responses. In view of the breadth of models covered, and 
the variety of notation in the literature and in the software, we review the 
setting in a little more detail, beginning with the familiar model for a single 
response 

y = X/3 + u. 

Here y is an n x 1 column vector of observations on a response variable, X 
is an n x q model matrix, and u is an n x 1 column vector of errors, assumed 
here to be independent and identically distributed as N(0,a 2 ). The q x 1 
vector (3 of unknown parameters has the least squares estimate — when X 
has full rank — given by 

^=(X'X)- 1 X'y. 

The error sum of squares SSe = (y — X/3)'(y — X/3) = y'Py, where P 
denotes orthogonal projection onto the subspace orthogonal to the columns 
of X, it has rank n — q, and so SSe ~ xf n - q )- 

Consider the linear hypothesis Hq : Ci/3 = 0, where Ci is a g x q matrix of 
rank g. In the simplest example, Ci = [I g 0] extracts the first g elements of 
/3; more generally, the rows of Ci are often contrasts among the components 
of j3. To describe the standard F-test of Hq, let C2 be any (q — g) x q matrix 
such that C = b ecomes an invertible q x q matrix. We may then write 

X/3=[XC( 1 ) XC( 2 )](^), 

where we have partitioned C _1 = [C^ C^ 2 )] into blocks with g and q — g 
columns respectively. 

Let Pi denote the orthogonal projection onto the subspace orthogonal to 
the columns of XC^. We have the sum of squares decomposition 

y / P 1 y = y / Py + y / (P 1 -P)y 

and the hypothesis sum of squares for testing Hq : Ci/3 = is given by 
SSh = y'P2Y) with P2 = Pi — P. The projection P2 has rank g and so 
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under Hq, SSh ~ xf g y The projections P and P2 are orthogonal and so the 
sums of squares have independent chi-squared distributions, and under Hq 
the traditional F-statistic 

*~ SS E /(n-q) t9 " l ~ q - 
Explicit expressions for the sums of squares are given by 
55 s = y'(I-X(X'X)- 1 X / )y, 

SS H = (C 1 0)'[C 1 (X , X)- 1 C 1 ]- 1 C 1 p. 
In the multivariate linear model, 

Y = XB + U, 

the single response y is replaced by p response vectors, organized as columns 
of the nxp matrix Y. The model (or design) matrix X remains the same for 
each response; however, there are separate vectors of unknown coefficients 
and errors for each response; these are organized into a q x p matrix B of re- 
gression coefficients and an nxp matrix E of errors. The multivariate aspect 
of the model is the assumption that the rows of U are indepedent, with mul- 
tivariate normal distribution having mean and common covariance matrix 
S. Thus, U is a normal data matrix of n samples from N p (0, S). Assuming 
for now that the model matrix X has full rank, the least squares estimator 

B = (X'X) -1 X'Y. 

The linear hypothesis becomes 

H : CiB = 0. 

The sums of squares of the univariate case are replaced by hypothesis and 
error sums of squares and products matrices: 

E = Y'PY = Y'(I - X(X'X)~ 1 X')Y, 

(13) 

H = Y'P 2 Y = (CiB/fC^X'X^Cil^CiB, 

in which the univariate vectors y and f3 are simply replaced by their multi- 
variate analogs Y and B. It follows that E ~ W p (S,n — q) and that under 
Hq, H ~ W p (E,g); furthermore, E and H are independent. Generalizations 
of the -F-test are obtained from the eigenvalues (Aj) of the matrix E _1 H or, 
equivalently, the eigenvalues 6i of (H + E) _1 H. 

Thus, under the null hypothesis CiB = 0, Roy's maximum root statistic 
6\ has null distribution 

9\ ~ 9(p,n — q, g), where 

(14) p = dimension, g = rank(Ci), 

q = rank(X) , n = sample size. 
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Two extensions, (a) X not of full rank. This situation routinely occurs 
when redundant parameterizations are used, for example, when dealing with 
factors in analysis of variance models. One approach (e.g., MKB, Section 6.4) 
is to rearrange the columns of X and partition X = [Xi X2] so that Xi has 
full rank. We must also assume that the matrix Ci is testable in the sense 
that, as a function of B, XB = implies CiB = 0. In such cases, if we par- 
tition Ci = [Cn C12] conformally with X, then C12 = Cn(X / 1 Xi)~ 1 X / 1 X2 
is determined from Cxi- 

With these assumptions, we use Xi and Cn in (13) and (14) in place of 
X and Ci. 

(b) Intra-subject hypotheses. A straightforward extension is possible in 
order to test null hypotheses of the form 

C1BM1 =0, 

where Mi is p x r of rank r. The columns of Mi capture particular linear 
combinations of the dependent variables — for an example, see, e.g., Morrison 
(2005), Chapter 3.6. 

We simply consider a modified linear model 

YMi = XBMi + UMi. 

An important point is that the rows of UMi are still independent, now 
distributed as N p (0, M^SM). So we may simply apply the above analysis, 
replacing Y, U and B by YMi, UM] and BMi respectively. In particular, 
the greatest root statistic now has null distribution given by 

01 ~9(r,n-q,g). 

Linear hypotheses in SAS. Analyses involving the four multivariate tests 
are provided in a number of SAS routines, such as GLM and CANCORR. 
The parameterization used here can be translated into that used in SAS by 
means of the documentation given in the SAS/STAT Users Guide — we refer 
to the section on Multivariate Tests in version 9.1, page 48 ff. The linear 
hypotheses correspond to MKB notation via 

MKB SAS 
~C[ L 
B (3 
Mi M 

while the parameters of the greatest root distribution are given by 





MKB 


SAS 


dimension 


r 


rank(Mi) 


rank(M) 


P 


hypothesis 


9 


rank(Ci) 


rank(L) 


q 


error 


n — 


1 




V 
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(Note: we use sans serif font for the SAS parameters!) Finally, the SAS 
printouts use the following parameters: 

s = p Aq, 

m = (|p-q|-l)/2, 
n = (v-p-l)/2. 

7. Concluding discussion. We have described the Tracy-Widom approx- 
imation to the null distribution for the largest root test for a variety of clas- 
sical multivariate procedures. These procedures exhibit varying degrees of 
sensitivity to the assumption of normality, independence etc. Documenting 
the sensitivity /robustness of the T-W approximation is clearly an important 
issue for further work. Two brief remarks can be made. In the correspond- 
ing single Wishart setting [e.g., Johnstone (2001)], the largest eigenvalue 
can be shown, under the null distribution, to still have the T-W limit if the 
original data have "light tails" (i.e., sub-Gaussian) [see Soshnikov (2002); 
Peche (2009)]. In the double Wishart settings, simulations for canonical cor- 
relation analysis with n = 100 samples on q = 20 and p = 10 variables, each 
following i.i.d. trg\ or i.i.d. random sign distributions, showed that the T-W 
distribution for the leading correlation r\ still holds in the central 99% of 
the distribution. 

APPENDIX: PROOF OF LEMMA 

If £ £ H s , there are at most s nonzero canonical correlations, and we 
may suppose without loss of generality that the q x-variables have been 
transformed so that only the last s of them have any correlation with Y. 
We employ the singular value decomposition (SVD) description of CCA, 
cf. Golub and Van Loan (1996), Section 12.4.3. Using QR decompositions, 
write 

X = Q X R X , Y = Q Y R Y . 

Let C = Q x Qy an d form the SVD C = URV T . Then the diagonal elements 
fi >r2 > ■ ■ ■ > T m in(p,(?) of R contain the sample canonical correlations. 

Now consider the reduced n x (q — s) matrix X - obtained by dropping the 
last s columns from X. Form the QR decomposition X~ = Q x -R x - . From 
the nature of the decomposition, we have Q x = [Q x - Q + ], that is, Q x - 
represents the first q — s columns of Q x . Consequently, C_ = Q x -Qy forms 
the first q — s rows of C. Our lemma now follows from the interlacing property 
of singular values [e.g., Golub and Van Loan (1996), Corollary 8.6.3]. 

<7 s+1 (C)<<n(C_). 

Indeed, our earlier discussion implies that X - and Y are independent, and 
so <7i(C_) has the null distribution C{r\\p,q — s,n;T). 
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